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Abstract 

We describe a framework for building abstraction 
hierarchies whereby an agent alternates skill- and 
representation-acquisition phases to construct a se¬ 
quence of increasingly abstract Markov decision pro¬ 
cesses. Our formulation builds on recent results show¬ 
ing that the appropriate abstract representation of a 
problem is specified by the agent’s skills. We describe 
how such a hierarchy can be used for fast planning, and 
illustrate the construction of an appropriate hierarchy 
for the Taxi domain. 


Introduction 

One of the core challenges of artificial intelligence is that 
of linking abstract decision-making to low-level, real-world 
action and perception. Hierarchical reinforcement learn¬ 
ing methods ( Barto and Mahadevan 200^ approach this 
problem through the use of high-level temporally extended 
macro-actions, or skills, which can significantly decrease 
planning times (jSutton, Precup, and Singh 1999). Skill ac¬ 
quisition (or skill discovery) algorithms (recently surveyed 
by HengstI ( |2012| l), aim to discover appropriate high-level 
skills autonomously. However, in most hierarchical rein¬ 
forcement learning research the state space does not change 
once skills have been acquired. An agent that has acquired 
high-level skills must still plan in its original low-level state 
space—a potentially very difficult task when that space is 
high-dimensional and continuous. Although some of the ear¬ 
liest formalizations of hierarchical reinforcement learning 
( |Parr and Russell 1997| |Dietterich 2000[ ) featured hierar¬ 
chies where both the set of available actions and the state 
space changed with the level of the hierarchy, there has been 
almost no work on automating the representational aspects 
of such hierarchies. 

Recently, [Konidaris, Kaelbling, and Lozano-Perez ( 2014| l 
considered the question of how to construct a symbolic rep¬ 
resentation suitable for planning in high-dimensional contin¬ 
uous domains, given a set of high-level skills. The key result 
of that work was that the appropriate abstract representation 
of the problem was directly determined by characteristics 
of the skills available to the agent—the skills determine the 
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representation, and adding new high-level skills must result 
in a new representation. 

We show that these two processes can be combined into 
a skill-symbol loop: the agent acquires a set of high-level 
skills, then constructs the appropriate representation for 
planning using them, resulting in a new problem in which 
the agent can again perform skill acquisition. Repeating this 
process leads to a true abstraction hierarchy where both the 
available skills and the state space become more abstract at 
each level of the hierarchy. We describe the properties of the 
resulting abstraction hierarchies and demonstrate the con¬ 
struction and use of one such hierarchy in the Taxi domain. 


Background 

Reinforcement learning problems are typically formalized 
as Markov decision processes or MDPs, represented by a 
tuple M = {S, A, R, P, 7 ), where S' is a set of states, A is 
a set of actions, R{s, a, s') is the reward the agent receives 
when executing action a in state s and transitioning to state 
s', P{s'\s, a) is the probability of the agent finding itself in 
state s' having executed action a in state s, and 7 S ( 0 , 1 ] is 
a discount factor. 

We are interested in the multi-task reinforcement learning 
setting where, rather than solving a single MDP, the agent is 
tasked with solving several problems drawn from some task 
distribution. Each individual problem is obtained by adding 
a set of start and goal states to a base MDP that specifies 
the state and action spaces and background reward function. 
The agent’s task is to minimize the average time required to 
solve new problems drawn from this distribution. 


Hierarchical Reinforcement Learning 


Hierarchical reinforcement learning (Barto and Mahadevan 
2003 | l is a framework for learning and planning using higher- 
level actions built out of the primitive actions available to the 
agent. Although other formalizations exist—mostly notably 
the MAX- Q (|Dietterich 2000|l and Hierarchy of Abstract 
Machines ( |Parr and Russell 1997|l approaches—we adop t 
the options framework ( Sutton, Precup, and Singh 1999} , 
which models temporally abstract macro-actions as options. 

An option o consists of three components: an option pol¬ 
icy, TTo, which is executed when the option is invoked; an 
initiation set, Iq = {s|o G 0 (s)}, which describes the states 































in which the option may be executed; and a termination con¬ 
dition, Pois) —>■ [0,1], which describes the probability that 
an option will terminate upon reaching state s. 

An MDP where primitive actions are replaced by a set 
of possibly temporally-extended options (some of which 
could simply execute a single primitive action) is known 
as a semi Markov decision process (or SMDP), which gen¬ 
eralizes MDPs to handle action executions that may take 
more than one time step. An SMDP is described by a tu¬ 
ple M = {S, O, R, P, 7 ), where S' is a set of states; O is a 
set of options; R{s' , r|s, o) is the reward received when ex¬ 
ecuting option o G 0{s) at state s G S, and arriving in state 
s' G S after r time steps; P{s' , r|s, o) is a PDF describing 
the probability of arriving in state s' G S,t time steps after 
executing option o G 0(s) in state s G S; and 7 G (0,1] is 
a discount factor, as before. 

The problem of deciding which options an agent should 
acquire is known as the ski/i discovery problem. A skill dis¬ 
covery algorithm must, through experience (and perhaps ad¬ 
ditional advice or domain knowledge), acquire new options 
by specifying their initiation set, Ig, and termination condi¬ 
tion, Po- The option policy is usually specified indirectly via 
an option reward function, Rg, which is used to learn tTq. 
Each new skill is added to the set of options available to the 
agent with the aim of either solving the original or subse¬ 
quent tasks more efficiently. Our framework is agnostic to 
the specific skill discovery method used (many exist). 

Representation Acquisition 

While skill acquisition allows an agent to construct higher- 
level actions, it alone is insufficient for constructing a true 
abstraction hierarchy because the agent must still plan in the 
original state space, no matter how abstract its actions be¬ 
come. A complementary approach is taken by recent work 
on representation acquisition ( |Konidaris, Kaelbling, arid] 
Lozano-Perez 2014) l, which considers the question of con¬ 
structing a symbolic description of an SMDP suitable for 
high-level planning. Key to this is the definition of a symbol 
as a name referring to a set of states: 

Definition 1. A propositional symbol az is the name of 
a test Tz, and corresponding set of states Z = {s G 
S I Tz{s) = 1}. 

The test, or grounding classifier, is a compact representa¬ 
tion of a (potentially uncountably infinite) set of states (the 
grounding set). Logical operations (e.g., and) using the re¬ 
sulting symbolic names have the semantic meaning of set 
operations (e.g., n) over the grounding sets, which allows us 
to reason about which symbols (and corresponding ground¬ 
ing classifiers) an agent should construct in order to be able 
to determine the feasibility of high-level plans composed of 
sequences of options. We use the grounding operator Q to 
obtain the grounding set of a symbol or symbolic expression; 
for example, Girtz) = Z, G{<Ja and cjb) = An i3. For con¬ 
venience we also define G over collections of symbols; for a 
set of symbols A, we define G (A) = U^g(aO,Vai e A. 

Konidaris, Kaelbling, and Lozano-Perez| ( |2014| l showed 
that defining a symbol for each option’s initiation set and the 
symbols necessary to compute its image (the set of states the 


agent might be in after executing the option from some set 
of starting states) are necessary and sufficient for planning 
using that set of options. The feasibility of a plan is evalu¬ 
ated by computing each successive option’s image, and then 
testing whether it is a subset of the next option’s initiation 
set. Unfortunately, computing the image of an option is in¬ 
tractable in the general case. However, the definition of the 
image for two common classes of options is both natural and 
computationally very simple. 

The first is the subgoal option: the option reaches some 
set of states and terminates, and the state it terminates in 
can be considered independent of the state execution began 
in. In this case we can create a symbol for that set (called 
the effect set —the set of all possible states the option may 
terminate in), and use it directly as the option’s image. We 
thus obtain 2n symbols for n options (a symbol for each 
option’s initiation and effect sets), from which we can build 
a plan graph representation: a graph with n nodes, and an 
edge from node i to node j if option fs initiation set is a 
superset of option Ts effect set. Planning amounts to finding 
a path in the plan graph; once this graph has been computed, 
the grounding classifiers can be discarded. 

The second class of options are abstract subgoal options: 
the low-level state is factored, and some variables are set to 
a subgoal (again, independently of the starting state) while 
others remain unchanged. The image operator can then be 
computed using the intersection of the effect set (as in the 
subgoal option case) and the starting state classifier with the 
modified factors projected out. This results in a STRIPS- 
like factored representation which can be automatically con¬ 
verted to PDDL ( [McDermott et al. 1998[ ) and used as in¬ 
put to an off-the-shelf task planner. After this conversion the 
grounding classifiers can again be discarded. 

Constructing Abstraction Hierarchies 

These results show that the two fundamental aspects of 
hierarchy—skills and representations—are tightly coupled: 
skill acquisition drives representational abstraction. An 
agent that has performed skill acquisition in an MDP to ob¬ 
tain higher-level skills can automatically determine a new 
abstract state representation suitable for planning in the re¬ 
sulting SMDP. We now show that these two processes can 
be alternated to construct an abstraction hierarchy. 

We assume the following setting: an agent is faced with 
some base MDP Mq, and aims to construct an abstraction 
hierarchy that enables efficient planning for new problems 
posed in Mq, each of which is specified by a start and goal 
state set. Mq may be continuous-state and even continuous- 
action, but all subsequent levels of the hierarchy will be con¬ 
structed to be discrete-state and discrete-action. We adopt 
the following definition of an abstraction hierarchy: 

Definition 2. An n-level hierarchy on base MDP Mq = 
(Sq, Aq, Rq, Pg) is a collection of MDPs Mi = 
{Si, Ai, Ri, Pi), i G {1, .■■, n}, such that each action set Aj, 
0 < j < n, is a set of options defined over Mj-i (i.e., 
Mj_i+ = {Sj-i, Aj, Rj-i, Pj-i) is an SMDP). 

This captures the core assumption behind hierarchical re¬ 
inforcement learning: hierarchies are built through macro- 
















(a) 


SI : 

above-box-1 x 
above-box-2 x 
pregrasped x 
grasped x 
apple-in-box-1 x 
apple-in-box-2 

(b) 


52 : 

grabbed x 
above-box-1 x 
above-box-2 x 
apple-in-box-1 x 
apple-in-box-2 

(c) 


^3 : 

apple-in-box-1 x 
apple-in-box-2 


(d) 


Figure 1: A robot must move an apple between two boxes (a). Given a set of motor primitives it can form a discrete, factored 
state space (b). Subsequent applications of skill acquisition result in successively more abstract state spaces (c and d). 


actions. Note that this formulation retains the downward re¬ 
finement property from classical hierarchical planning ([Bac¬ 


chus and Yang 1991) —meaning that a plan at level j can be 


refined to a plan at level j — 1 without backtracking to level j 
or higher—^because a policy at any level is also a (not neces¬ 
sarily Markovian ( Sutton, Precup, and Singh 1999| l) policy 
at any level lower, including Mq. However, while Definition 
l^links the action set of each MDP to the action set of its pre¬ 
decessor in the hierarchy, it says nothing about how to link 
their state spaces. To do so, we must in addition determine 
how to construct a new state space Sj, transition probability 
function Pj, and reward function Rj. 

Fortunately, this is exactly what representation acquisition 
provides: a method for constructing a new symbolic repre¬ 
sentation suitable for planning in Mj_i+ using the options 
in Aj. This provides a new state space Sj, which, com¬ 
bined with Aj, specifies Pj. The only remaining compo¬ 
nent is the reward function. A representation construction 
algorithm based on sets (iKonidaris, Kaelbling, and Lozano- 
jPerez 2014| l —such as we adopt here—is insufficient for 
reasoning about expected rewards, which requires a for¬ 


mulation based on distributions (Konidaris, Kaelbling, and 
[Lozano-Perez 2015| l. For simplicity, we can remain consis¬ 
tent and simply set the reward to a uniform transition penalty 
of —1; alternatively, we can adopt just one aspect of the 
distribution-based representation and set Rj to the empirical 
mean of the rewards obtained when executing each option. 

Thus, we have all the components required to build level 
j of the hierarchy from level j — 1. This procedure can be 
repeated in a skill-symbol loop —alternating skill acquisition 
and representation acquisition phases—to construct an ab¬ 
straction hierarchy. It is important to note that there are no 
degrees of freedom or design choices in the representation 
acquisition phase of the skill-symbol loop; the algorithmic 
questions reside solely in determining which skills to ac¬ 
quire at each level. 

This construction results in a specific relationship be¬ 
tween MDPs in a hierarchy: every state at level j refers to a 
set of states at level j — l|jA grounding in Mq can therefore 
be computed for any state at level j in the hierarchy by ap¬ 
plying the grounding operator j times. If we denote this “fi¬ 
nal grounding” operator as Qq, then Vj, Sj G Sj,3ZQ C Sq 
such that Go{sj) = Zq. 


'Note that 5j+i is not necessarily a partition of Sj —the 
grounding sets of two states in Sj+i may overlap. 


We now illustrate the construction of an abstraction hi¬ 
erarchy via an example—a very simple task that must be 
solved by a complex agent. Consider a robot in a room 
with two boxes, one containing an apple (Figure [^). The 
robot must occasionally move the apple from one box to the 
other. Directly accomplishing this involves solving a high¬ 
dimensional motion planning problem, so instead the robot 
is given five motor skills: move-gripper-abovel and move- 
gripper-above2 use motion planning to move the robot’s 
gripper above each box; pregrasp controls the gripper so 
that it cages the apple, and is only executable from above 
it; grasp can be executed following pregrasp, and runs a 
gradient-descent based controller to achieve wrench-closure 
on the apple; and release drops the apple. These form Ai, the 
actions in the first level of the hierarchy, and since they are 
abstract subgoal options the robot automatically constructs 
a factored state space (see Figure[^) that specifies M 2 . This 
enables abstract planning—the state space is independent of 
the complexity of the robot, although S 2 contains some low- 
level details (e.g., pregrasped). 

Applying a skill discovery algorithm in M 2 , the robot de¬ 
tects that pregrasp is always followed by grasp, and there¬ 
fore replaces these actions with grab-apple, which together 
with the remaining skills in Ai forms A 2 . This results in 
a smaller MDP, M 2 (Figure [TJ;), which is a good abstract 
model of the task. Applying a skill discovery algorithm to 
M 2 creates a skill that picks up the apple in whichever box 
it is in, and moves it over the other box. Aq now consists of 
just a single action, swap-apple, requiring just two proposi¬ 
tions to define Sq\ apple-in-box-1, and apple-in-box-2 (Fig¬ 
ure [^). The abstraction hierarchy has abstracted away the 
details of the robot (in all its complexity) and exposed the 
(almost trivial) underlying task structure. 

Planning Using an Abstraction Hierarchy 

Once an agent has constructed an abstraction hierarchy, it 
must be able to use it to rapidly find plans for new problems. 
We formalize this process as the agent posing a plan query to 
the hierarchy, which should then be used to generate a plan 
for solving the problem described by the query. We adopt 
the following definition of a plan query: 

Definition 3. A plan query is a tuple {B,G), where B C 
Sq is the set of base MDP states from which execution may 
begin, and G G Sq (the goal) is the set of base MDP states 
in which the agent wishes to find itself following execution. 



















The critical question is at which level of the hierarchy 
planning should take place. We first define a useful predi¬ 
cate, planmatch, which determines whether an agent should 
attempt to plan at level j (see Figure]^; 

Definition 4. A pair of abstract state sets b and g match 
a plan query (B, G) (denoted planmatch( 6 , p, S, G)) when 
B C 5o(^) and Go{g) C G. 

Theorem 1. A plan can be found to solve plan query (B, G) 
at level j iff^b^ g C Sj such thatplanmatch{b, g, B, G), and 
there is a feasible plan in Mi from every state in b to some 
state in g. 

Proof The MDP at level j is constructed such that a plan 
p starting from any state in Q{b) (and hence also Go{b)) is 
guaranteed to leave the agent in a state in G{g) 
also Go {g)) iff p is a plan in MDP Mj from bto g 
Kaelbling, and Lozano-Perez 2014| . 

Plan p is additionally valid from B to G iff B C Go{b) 
(the start state at level j refers to a set that includes all query 
start states) and Goig) ^ G (the query goal includes all 
states referred to by the goal at level j). □ 


(and hence 
(Konidaris, 



Figure 2; The conditions under which a plan at MDP Mj 
answers a plan query with start state set B and goal state set 
G in the base MDP Mq. A pair of state sets b,g C Sj are 
required such that B C Go{b), Goig) ^ G, and a plan exists 
in Mj from every state in b to some state in g. 

Note that b and g may not be unique, even within a single 
level: because Sj is not necessarily a partition of Sj-i, there 
may be multiple states, or sets of states, at each level whose 
final groundings are included by G or include B; a solution 
from any such b to any such g is sufficient. For efficient plan¬ 
ning it is better for 6 to be a small set to reduce the number 
of start states while remaining large enough to subsume i3; 
if 6 = Sj then answering the plan query requires a complete 
policy for Mj, rather than a plan. However, finding a min¬ 
imal subset is computationally difficult. One approach is to 
build the maximal candidate set b = {s|C/o(s) H B 7 ^ 0 , s G 
Sj}. This is a superset of any start match, and a suitable one 
exists at this level if and only if H C Usg65o(s)- Similarly, 
g should be maximally large (and so easy to reach) while 
remaining small enough so that its grounding set lies within 
G. At each level j, we can therefore collect all states that 
ground out to subsets of G: g = {s|C/o(s) C G, s G Sj}. 


These approximations result in a unique pair of sets of states 
at each level—at the cost of potentially including unneces¬ 
sary states in each set— and can be computed in time linear 
in \Sj\. 

It follows from the state abstraction properties of the hi¬ 
erarchy that a planmatch at level j implies the existence of a 
planmatch at all levels below j. 

Theorem 2. Given a hierarchy of state spaces {So, Sn} 
constructed as above and plan query (B, G), if3b,g C Sj 
such that planmatch{b, g, B,G), for some j,n > j > 0, 
then 3b', g' C Sk such that planmatcliQ)', g', B ,G), V/c G 

Proof. We first consider k = j — 1. Let b' = Gib), and 
g' = Gig). Both are, by definition, sets of states in Sj-i. By 
definition of the final grounding operator, Goib) = Goib') 
and Goig) = Goig'), hence B C Goib') and Goig') G 
G. This process can be repeated to reach any k < j. □ 

Any plan query therefore has a unique highest level j 
containing a planmatch. This leads directly to Algorithmic 
which starts looking for a planmatch at the highest level of 
the hierarchy and proceeds downwards; it is sound and com¬ 
plete by Theorem [C 

Input: MDP hierarchy {Mq, ..., M„}, query [B, G). 

for j G {n,..., 0} do 

for V6, g C Sj s.t. planmatch(&, g, B, G) do 
TT ^ findplan(Mf, b, g) 

if TT 7 ^ null then 

I return {Mj, tt) 
end 

end 

end 

return null 

Algorithm 1: A simple hierarchical planning algorithm. 


The complexity of Algorithmic depends on its two com¬ 
ponent algorithms: one used to find a planmatch, and an¬ 
other to attempt to find a plan (possibly with multiple start 
states and goals). We denote the complexity of these al¬ 
gorithms as to(| 5'|) (linear using the approach described 
above) and p(|5'|), for a problem with liSI states, respec¬ 
tively. The complexity of finding a plan at level I, where the 
first match is found at level fc > (, is given by h{k, I, M) = 

ELfc+i"i(l'S'a|) + ELi N(l'5'b|) -f p(|5'bI)], for a hierar¬ 
chy M with n levels. The first term corresponds to the search 
for the level with the first planmatch; the second term for the 
repeated planning at levels that contain a match but not a 
plan (a planmatch does not necessarily mean a plan exists at 
that level—merely that one could). 

Discussion 

The formula for h highlights the fact that hierarchies make 
some problems easier to solve and others harder: in the 
worst case, a problem that should take pdiSol) time— 
one only solvable via the base MDP—could instead take 
Sa=o [*^(1 ‘5'a I) + p(|‘S'bI)] time. A key question is therefore 


















how to balance the depth of the hierarchy, the rate at which 
the state space size diminishes as the level increases, which 
specific skills to discover at each level, and how to control 
false positive plan matches, to reduce planning time. 


Recent work has highlighted the idea that skill discovery 
algorithms should aim to reduce average planning or learn¬ 
ing time across a target distribution of tasks ( |§imgek and| 
Barto 20d8t Solway et al. 2014[). Following this logic, a hi¬ 


erarchy M for some distribution of over task set T should 
be constructed so as to minimize jj. h{k{t), l{t), M)P{t)dt, 
where k and I now both depend on each task t. Minimiz¬ 
ing this quantity over the entire distribution seems infeasi¬ 
ble; an acceptable substitute may be to assume that the tasks 
the agent has already experienced are drawn from the same 
distribution as those it will experience in the future, and to 
construct the hierarchy that minimizes h averaged over past 
tasks. 


The form of h suggests two important principles which 
may aid the more direct design of skill acquisition algo¬ 
rithms. One is that deeper hierarchies are not necessarily bet¬ 
ter; each level adds potential planning and matching costs, 
and must be justified by a rapidly diminishing state space 
size and a high likelihood of solving tasks at that level. Sec¬ 
ond, false positive plan matches—when a pair of states that 
match the query is found at some level at which a plan can¬ 
not be found—incur a significant time penalty. The hierar¬ 
chy should therefore ideally be constructed so that every 
likely goal state at each level is reachable from every likely 
start state at that level. 


An agent that generates its own goals—as a completely 
autonomous agent would—could do so by selecting an ex¬ 
isting state from an MDP at some level (say j) in the hierar¬ 
chy. In that case it need not search for a matching level, and 
could instead immediately plan at level j, though it may still 
need to drop to lower levels if no plan is found in Mj. 


An Example Domain: Taxi 


We now explain the construction and use of an abstraction 
hierarchy for a common hierarchical reinforcement learning 
benchmark; the Taxi domain ( Dietterich 2000) l, depicted in 
Figure]^. A taxi must navigate a 5 x 5 grid, which contains 
a few walls, four depots (labeled red, green, blue, and yel¬ 
low), and a passenger. The taxi may move one square in each 
direction (unless impeded by a wall), pick up a passenger 
(when occupying the same square), or drop off a passenger 
(when it has previously picked the passenger up). A state at 
base MDP Mq is described by 5 state variables; the x and y 
location of the taxi and the passenger, and whether or not the 
passenger is in the taxi. This results in a total of 650 states 
(25 X 25 = 625 states for when the passenger is not in the 
taxi, plus another 25 for when the passenger is in the taxi 
and they are constrained to have the same location). 


We now describe the construction of a hierarchy for the 
taxi domain using hand-designed options at each level, and 
present some results for planning using Algorithm for 
three example plan queries. 



(a) 



(b) 


Figure 3; The Taxi Domain (a), and its induced 3-level hi¬ 
erarchy. The base MDP contains 650 states (shown in red), 
which is abstracted to an MDP with 20 states (green) after 
the first level of options, and one with 4 states (blue) after the 
second. At the base level, the agent makes decisions about 
moving the taxi one step at a time; at the second level, about 
moving the taxi between depots; at the third, about moving 
the passenger between depots. 


Constructing Mi. In this version of taxi, the agent is able 
to move the taxi to, and drop the passenger at, any square, 
but it expects to face a distribution of problems generated by 
placing the taxi and the passenger at a depot at random, and 
selecting a random target depot at which the passenger must 
be deposited. Consequently, we create navigation options for 
driving the taxi to each dmot, and retain the existing put- 
down and pick-up options]^ These options over Mq form the 
action set for level 1 of the hierarchy; Ai = {drive-to-red, 
drive-to-green, drive-to-blue, drive-to-yellow, pick-up, put- 
down}. 

Consider the drive-to-blue-depot option. It is executable 
in all states (i.e., its initiation set is Sq), and terminates with 
the taxi’s x and y position set to the position of the blue de¬ 
pot; if the passenger is in the taxi, their location is also set 
to that of the blue depot; otherwise, their location (and the 
fact that they are not in the taxi) remains unchanged. It can 
therefore be partitioned into two abstract subgoal options; 
one, when the passenger is in the taxi, sets the x and y po¬ 
sitions of the taxi and passenger to those of the blue depot; 
another, when the passenger is not in the taxi, sets the taxi 


^These roughly correspond to the hand-designed hierarchical 
actions used in 


Dietterich (2000 l 




























Query 

Level 

Hierarchical Planning 
Matching Planning Total 

Base -H Options 

Base MDP 

1 

2 

<1 

<1 

<1 

770.42 

1423.36 

2 

1 

<1 

10.55 

11.1 

1010.85 

1767.45 

3 

0 

12.36 

1330.38 

1342.74 

1174.35 

1314.94 


Figure 4; Timing results for three example queries in the Taxi domain. The final three columns compare the total time for 
planning using the hierarchy, by planning in the SMDP obtained by adding all options into the base MDP (i.e., using options 
but not changing the representation), and by flat planning in the base MDP All times are in milliseconds and are averaged over 
100 samples, obtained using a Java implementation run on a Macbook Air with a 1.4 GHz Intel Core i5 and 8 GB of RAM. 


X and y coordinates and leaves those of the passenger un¬ 
changed. Both leave the in-taxi state variable unmodified. 
Similarly, the put-down and pick-up options are executable 
everywhere and when the taxi and passenger are in the same 
square, respectively, and modify the in-taxi variable while 
leaving the remaining variables the same. Partitioning all op¬ 
tions in Ai into abstract subgoal options results in a factored 
state space consisting of 20 reachable states where the taxi 
or passenger are at the depot locations (4x4 states for when 
the passenger is not in the taxi, plus 4 for when they are). 

Constructing M 2 . Given Mi, we now build the second 
level of the hierarchy by constructing options that pick up 
the passenger (wherever they are), move them to each of 
the four depots, and drop them off. These options become 
A 2 — {passenger-to-blue, passenger-to-red, passenger- 
to-green, passenger-to-yellow}. Each option is executable 
whenever the passenger is not already at the relevant depot, 
and it leaves the passenger and taxi at the depot, with the 
passenger outside the taxi. Since these are subgoal (as op¬ 
posed to abstract subgoal) options, the resulting MDP, Mi, 
consists of only 4 states (one for each location of the pas¬ 
senger) and is a simple (and coincidentally fully connected) 
graph. The resulting hierarchy is depicted in Figure]^. 

We used the above hierarchy to compute plans for three 
example queries, using dynamic programming and decision 
trees for planning and grounding classifiers, respectively. 
The results are given in Table|^ we next present each query, 
and step through the matching process in detail. 

Example Query 1. Query Qi has the passenger start at 
the blue depot (with the taxi at an unknown depot) and re¬ 
quest to be moved to the red depot. In this case Bi refers 
to all states where the passenger is at the blue depot and the 
taxi is located at one of four depots, and Gi similarly refers 
to the red depot. The agent must first determine the appro¬ 
priate level to plan at, starting from M 2 , the highest level 
of the hierarchy. It finds state where Go{sb) = Bi (and 
therefore Bi C Qo{sb) holds), and Sr where Go(sr) = Gi 
(and therefore Go(sr) Q Gi), where Sb and Sr are the states 
in M 2 referring to the passenger being located at the blue 
and red depots, respectively. Planning therefore consists of 
finding a plan from Sb to Sr at level M 2 ', this is virtually triv¬ 
ial (there are only four states in M 2 and the state space is 
fully connected). 

Example Query 2. Query Q 2 has the start state set as 


before, but now specifies a goal depot (the yellow depot) 
for the taxi. B 2 refers to all states where the passenger is at 
the blue depot and the taxi is at an unknown depot, but G 2 
refers to a single state. M 2 contains a state that has the same 
grounding set as B 2 , but no state in M 2 is a subset of G 2 
because no state in M 2 specifies the location of the taxi. The 
agent therefore cannot find a planmatch for <52 at level M 2 . 

At Ml no single state is a superset of B 2 , but the agent 
finds a collection of states Sj, such that ^o(UjSj) = B 2 . 
It also finds a single state with the same grounding as G 2 . 
Therefore, it builds a plan at level Mi for each state in Sj. 

Example Query 3. In query Q 3 , the taxi begins at the red 
depot and the passenger at the blue depot, and its goal is to 
leave the passenger at grid location ( 1 ,4), with the taxi goal 
location left unspecified. The start set, B 3 , refers to a single 
state, and the goal set, G 3 , refers to the set of states where 
the passenger is located at (1,4). 

Again the agent starts at M 2 . B 3 is a subset of the ground¬ 
ing of the single state in M 2 where the passenger is at the 
blue depot but the taxi is at an unknown depot. However, G 3 
is not a superset of any of the states in M 2 , since none con¬ 
tain any state where the passenger is not at a depot. There¬ 
fore the agent cannot plan for Q 3 at level M 2 . 

At level Ml, it again find a state that is a superset of B^,, 
but no state that is a subset of G 3 —all states in Mi now 
additionally specify the position of the taxi and passenger, 
but like the states in M 2 they all fix the location of the pas¬ 
senger at a depot. All state groundings are in fact disjoint 
from the grounding of G 3 . The agent must therefore resort 
to planning in Mq, and the hierarchy does not help (indeed, 
it results in a performance penalty due to the compute time 
to rule out Mi and M 2 ). 

Summary 

We have introduced a framework for building abstraction hi¬ 
erarchies by alternating skill- and representation-acquisition 
phases. The framework is completely automatic except for 
the choice of skill acquisition algorithm, to which our for¬ 
mulation is agnostic. The resulting hierarchies combine tem¬ 
poral and state abstraction to realize efficient planning and 
learning in the multi-task setting. 
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